Sparse Join

Consider a table join where only a small percentage of the rows in one of the tables is required to satisfy the join.

For example, consider a data warehouse for a large organisation whose customer table contains 1 million customers. Any customer-based report generated from this warehouse would need to join an even larger fact table to the customer table, but given that most such reports would be reporting on a reduced set of fact data (eg. for a single region, or date), it is unlikely that any one report would need to read a large proportion of the customer table.

With a sparse join involving two large tables, the two most common join methods provide sub-optimal performance:

Sparse joins fall into two distinct categories, each with a separate solution:

  1. Where the reduced set of rows in the sparse table is determined by a column in the sparse table. eg. Customers in a particular region.

  2. Where the reduced set of rows in the sparse table is determined by a column in the other table. eg. All customers who placed an order in September.

Both of these solutions involve radical changes to the structure of the sparse table. They are not so much a solution to a tuning problem as a design alternative that should be implemented when a system is first being built. Most importantly, no table should be either clustered or partitioned by a developer; in a live system this should only be performed by an experienced DBA after full regression testing in a test database.


©Copyright 2003